This is an R Markdown document, using a Tufte Handouts style template.1 R Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. And as we happily use the Tufte Handouts format, please consider the following documentation https://rstudio.github.io/tufte/.
We work preferably with the tidytext package, which is somewhat more intuitive and modern than the suggested tm package.
First, we load our data by using the readr::read_lines function and converting the data into tibbles on the fly:
# unzip("Coursera-SwiftKey.zip")
US_twitter <- as_tibble(read_lines("final/en_US/en_US.twitter.txt"))
US_news <- as_tibble(read_lines("final/en_US/en_US.news.txt"))
US_blogs <- as_tibble(read_lines("final/en_US/en_US.blogs.txt"))
US_total <- rbind(US_twitter, US_news, US_blogs)
And just for the sake of it, here are the first lines of each documents:
| 1 | How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long. |
| 2 | When you meet someone special… you’ll know. Your heart will beat more rapidly and you’ll smile for no reason. |
| 3 | they’ve decided its more fun if I don’t. |
| 4 | So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;) |
| 5 | Words from a complete stranger! Made my birthday even better :) |
| 6 | First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go! |
| 7 | i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up…damn thing |
| 8 | I’m coo… Jus at work hella tired r u ever in cali |
| 9 | The new sundrop commercial …hehe love at first sight |
| 10 | we need to reconnect THIS WEEK |
| News | |
|---|---|
| 1 | He wasn’t home alone, apparently. |
| 2 | The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s. |
| 3 | WSU’s plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building. |
| 4 | The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15. |
| 5 | And when it’s often difficult to predict a law’s impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse? |
| Blogs | |
|---|---|
| 1 | In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”. |
| 2 | We love you Mr. Brown. |
| 3 | Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him. |
| 4 | so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home. |
| 5 | With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one’s life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE’s new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy! |
| Source | #Lines | #Words | #UniqueWords |
|---|---|---|---|
| 2360148 | 30093372 | 370388 | |
| News | 1010242 | 34762395 | 284533 |
| Blogs | 899288 | 37546239 | 320003 |
Now, the tidytext package becomes pretty handy as we easily convert our text lines data into a tidy single-words document:
US_token <- US_total%>%
unnest_tokens(word, value)
| Each Word from the full corpus (1 to 10) | |
|---|---|
| 1 | how |
| 2 | are |
| 3 | you |
| 4 | btw |
| 5 | thanks |
| 6 | for |
| 7 | the |
| 8 | rt |
| 9 | you |
| 10 | gonna |
Let’s say, I wouldn’t be able to guess the role of a profanity or a foreign word in a sentence, and hence I’d better drop all sequences of words (the respective lines in my data set) that contain some.
First, let’s build a profanity and foreign language dictionnary:
We bundle together lists of profanities from the lexicon package:
profanity_full <- unique(tolower(c(profanity_alvarez,
profanity_arr_bad,
profanity_banned,
profanity_racist,
profanity_zac_anger)))
| word | |
|---|---|
| 1 | *damn |
| 2 | *dyke |
| 3 | fuck |
| 4 | shit |
| 5 | @$$ |
| 6 | ahole |
| 7 | amcik |
| 8 | andskota |
| 9 | anus |
| 10 | arschloch |
Oh my, that’s bad indeed!
And I have at my disposal lists of common words in other languages too! I use the website https://1000mostcommonwords.com to access common words lists in French and Spanish.
# French Words
Words.fr <- as_tibble(read.csv("french-word-list-total_vMMi.csv"))
Words.fr2 <- Words.fr[1:50,]
# I take only a small sample as it gets messy fast
# due to lexical ambiguities
Only_fr <- Words.fr2[!Words.fr2$word %in% sw_fry_1000,]$word
Only_fr<- Only_fr[!(str_detect(Only_fr,"[A-Z]+")|str_detect(Only_fr,"\\?")|
str_detect(Only_fr,"pour"))]
# e.g. "pour" is ambiguous and means to/for in French!
# Spanish Words
Words.sp <- as_tibble(read.csv("spanish-word-list-total_vMMi.csv"))
Words.sp2 <- Words.sp[1:50,]
Only_sp <- Words.sp2[!Words.sp2$word %in% sw_fry_1000,]$word
Only_sp<- Only_sp[!(str_detect(Only_sp,"[A-Z]+")|str_detect(Only_sp,"\\?"))]
It could be relevant to do the exercise for more languages, and to find out a more sophisticated approach. Wikipedia has great entries on words ranked by frequency in different languages, and it’s apparently a rich topic.
Still, I can now build my ‘clean’ corpus:
CleanCheck <- as_tibble(unique(c(Only_fr, Only_sp, profanity.data$word)))
names(CleanCheck) <- "word"
US_tokenClean <- US_token %>%
anti_join(CleanCheck, by = "word")
I put first a classic cloud visualization for all words in the corpora, as well as among non-trivial words (just FYI, this doesn’t particularly help the forecasting tool)
And below the histogram for the 50 most frequent words out of 30 samples:
This plot shows the approximate frequency of the most common words covering around 37% of the full corpora.
I suggest here some measures to be considered/implemented further on in the project.
Maybe using synonyms dictionary could help reduce the problem complexity.
As observed earlier with foreign languages, words frequency ranking is a well documented topic with sophisticated method associated to it. Checking common words that could be missing in the corpora could help too.
The full corpus (as the sum of corpora), when tokenized as single words or bigrams is a very large vector. Sampling is required to work with it. We will need to think about sample size to ensure fast and accurate predictions.
A big upcoming step is to train the creation of a chain from any word to the next and to store it efficiently.
Here I code a simple function forecasting the next word out of the existing corpora:
US_bigrams <- US_totalSamp %>%
unnest_tokens(bigram, value, token = "ngrams", n = 2)
nextW <- function(a) {
pattern0 <- paste(a,"\\S",sep = " ")
A <- str_subset(US_bigrams$bigram, pattern0)
pattern <- paste("(?<=",a,"\\s)(\\w)+",sep = "")
as_tibble(str_extract(A,pattern))%>%
count(value) %>%
arrange(desc(n))
}
This has still to be adapted when the input is nowhere to be found. And alternative version with different inputs can also be added (incomplete word, chain of 2 words…)
Then it must be implemented in a reactive Shiny App!
Below some draft coding of such application:
shinyApp(
ui = fluidPage(
titlePanel("Word Forecast"),
textInput("word1", "First Word:",
value = "the"),
submitButton("What's next?"),
textOutput("word2")
),
server = function(input, output) {
output$word2 <- renderText({
as.character(nextW(input$word1)[1,1])
})
},
)
And you can find the app running here: https://mathieu-michel.shinyapps.io/wordforecast/
All previous steps will likely require some consideration of speed and accuracy and the choice of a trade-off.